hugo@dme.ufrj.br\[g(\mathbf{x}) = \frac{1}{k}\sum_{i \in \mathcal{N}_{\mathbf{x}}^{(k)}} y_i,\] onde \(\mathcal{N}_{\mathbf{x}}^{(k)}\) é o conjunto dos índices das \(k\) observações mais próximas de \(\mathbf{x}\), ou seja, \[\mathcal{N}_{\mathbf{x}}^{(k)} = \left\{i \in \{1, \dots, n\} ~ \left| ~ d(\mathbf{x}_i, \mathbf{x}) \leq d_{\mathbf{x}}^k \right\}\right.,\] e \(d_{\mathbf{x}}^k\) representa a distância do \(k\)-ésimo vizinho mais próximo de \(\mathbf{x}\) para \(\mathbf{x}\).
\(X \sim \mathcal{U}[0, 4]\)
\(f(x) = \sin(\pi x)\)
\(Y = f(X) + \varepsilon\)
\(\varepsilon \sim \mathcal{N}(0, 0.4)\)
scikit-learn - KNeighborRegressor
KNN = KNeighborsRegressor(n_neighbors = 1)
KNN.fit(x_tr.reshape(-1, 1), y_tr)
print('Dimensões sem reshape :', x_tr.shape)
print('Dimensões com reshape :', x_tr.reshape(-1, 1).shape)Dimensões sem reshape : (400,)
Dimensões com reshape : (400, 1)
KNN = KNeighborsRegressor(n_neighbors = 5)
KNN.fit(x_tr.reshape(-1, 1), y_tr)
print('Dimensões sem reshape :', x_tr.shape)
print('Dimensões com reshape :', x_tr.reshape(-1, 1).shape)Dimensões sem reshape : (400,)
Dimensões com reshape : (400, 1)
KNN = KNeighborsRegressor(n_neighbors = 50)
KNN.fit(x_tr.reshape(-1, 1), y_tr)
print('Dimensões sem reshape :', x_tr.shape)
print('Dimensões com reshape :', x_tr.reshape(-1, 1).shape)Dimensões sem reshape : (400,)
Dimensões com reshape : (400, 1)
KNN = KNeighborsRegressor(n_neighbors = 150)
KNN.fit(x_tr.reshape(-1, 1), y_tr)
print('Dimensões sem reshape :', x_tr.shape)
print('Dimensões com reshape :', x_tr.reshape(-1, 1).shape)Dimensões sem reshape : (400,)
Dimensões com reshape : (400, 1)
Encontrando o melhor \(k\) por validação cruzada
KNN = KNeighborsRegressor()
param_grid_KNN = {"n_neighbors": range(1, 31)}
KNNCV = GridSearchCV(KNN, param_grid = param_grid_KNN,
scoring='neg_mean_squared_error', cv = 5)
KNNCV.fit(x_tr.reshape(-1, 1), y_tr)
print(KNNCV.best_estimator_)KNeighborsRegressor(n_neighbors=17)
Otimizando outros hiperparâmetros por validação cruzada
KNN = KNeighborsRegressor()
param_grid_KNN = {"n_neighbors": range(1, 31),
"weights": ['uniform', 'distance'], "p": range(1, 5)}
KNNCV = GridSearchCV(KNN, param_grid = param_grid_KNN, scoring='neg_mean_squared_error', cv = 5)
KNNCV.fit(x_tr.reshape(-1, 1), y_tr)
print(KNNCV.best_estimator_, "Weights: ", KNNCV.best_estimator_.weights)KNeighborsRegressor(n_neighbors=17, p=1) Weights: uniform
algorithmleaf_sizealgorithm and leaf_size.”“Being able to sense simultaneously thousands of variables on each ”individual” sounds like good news: Potentially we will be able to scan every variable that may influence the phenomenon under study. The statistical reality unfortunately clashes with this optimistic statement: Separating the signal from the noise is in general almost impossible in high-dimensional data. This phenomenon […] is often called the ‘curse of dimensionality’.”
“The impact of high dimensionality on statistics is multiple. First, high-dimensional spaces are vast and data points are isolated in their immensity. Second, the accumulation of small fluctuations in many different directions can produce a large global fluctuation. Third, an event that is an accumulation of rare events may not be rare. Finally, numerical computations and optimizations in high-dimensional spaces can be overly intensive.”
Fonte: Christophe Giraud - Introduction to High-Dimensional Statistics
| Dimensão (p) | Coeficiente de variação | |
|---|---|---|
| 0 | 1 | 0.706066 |
| 1 | 2 | 0.479571 |
| 2 | 3 | 0.365652 |
| 3 | 4 | 0.319981 |
| 4 | 5 | 0.288417 |
| 5 | 6 | 0.254690 |
| 6 | 7 | 0.234621 |
| 7 | 8 | 0.215403 |
| 8 | 9 | 0.209124 |
| 9 | 10 | 0.189904 |
| 10 | 100 | 0.059053 |
| 11 | 1000 | 0.019566 |
| 12 | 10000 | 0.006003 |
| 13 | 100000 | 0.001834 |
| 14 | 1000000 | 0.000581 |
| Dimensão (p) | Desvio padrão | Média | |
|---|---|---|---|
| 0 | 1 | 0.235095 | 0.332964 |
| 1 | 2 | 0.234640 | 0.489271 |
| 2 | 3 | 0.253379 | 0.692950 |
| 3 | 4 | 0.246943 | 0.771743 |
| 4 | 5 | 0.250732 | 0.869337 |
| 5 | 6 | 0.252654 | 0.992007 |
| 6 | 7 | 0.249233 | 1.062280 |
| 7 | 8 | 0.242561 | 1.126082 |
| 8 | 9 | 0.247946 | 1.185640 |
| 9 | 10 | 0.244178 | 1.285802 |
| 10 | 100 | 0.241119 | 4.083056 |
| 11 | 1000 | 0.252474 | 12.903961 |
| 12 | 10000 | 0.244872 | 40.789564 |
| 13 | 100000 | 0.236771 | 129.097221 |
| 14 | 1000000 | 0.237349 | 408.246510 |
| Dimensão (p) | menor dist/dist média | |
|---|---|---|
| 0 | 1 | 0.001487 |
| 1 | 5 | 0.187291 |
| 2 | 10 | 0.364847 |
| 3 | 50 | 0.776107 |
| 4 | 100 | 0.813959 |
| 5 | 500 | 0.909649 |
| 6 | 1000 | 0.942657 |
| 7 | 5000 | 0.974147 |
| 8 | 10000 | 0.980400 |
| 9 | 50000 | 0.992058 |
| 10 | 100000 | 0.993582 |